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Abstract 

Sparse PCA provides a linear combination of small number of features that maxi- 
mizes variance across data. Although Sparse PCA has apparent advantages com- 
pared to PCA, such as better interpretability, it is generally thought to be compu- 
tationally much more expensive. In this paper, we demonstrate the surprising fact 
that sparse PCA can be easier than PCA in practice, and that it can be reliably 
applied to very large data sets. This comes from a rigorous feature elimination 
pre-processing result, coupled with the favorable fact that features in real-life data 
typically have exponentially decreasing variances, which allows for many features 
to be eliminated. We introduce a fast block coordinate ascent algorithm with much 
better computational complexity than the existing first-order ones. We provide ex- 
perimental results obtained on text corpora involving millions of documents and 
hundreds of thousands of features. These results illustrate how Sparse PCA can 
help organize a large corpus of text data in a user-interpretable way, providing an 
attractive alternative approach to topic models. 



1 Introduction 

The sparse Principal Component Analysis (Sparse PCA) problem is a variant of the classical PCA 
problem, which accomplishes a trade-off between the explained variance along a normalized vector, 
and the number of non-zero components of that vector. 

Sparse PCA not only brings better interpretation [1 1, but also provides statistical regularization O 
when the number of samples is less than the number of features. Various researchers have proposed 
different formulations and algorithms for this problem, ranging from ad-hoc methods such as factor 
rotation techniques [3| and simple thresholding [4], to greedy algorithms J5j|6l. Other algorithms 
include SCoTLASS by Q, SPCA by 0, the regularized SVD method by E) and the generalized 
power method by flOl . These algorithms are based on non-convex formulations, and may only 
converge to a local optimum. The ^i-norm based semidefinite relaxation DSPCA, as introduced 
in HI, does guarantee global convergence and as such, is an attractive alternative to local methods. 
In fact, it has been shown in jjT] |2] [TT| that simple ad-hoc methods, and the greedy, SCoTLASS 
and SPCA algorithms, often underperform DSPCA. However, the first-order algorithm for solving 
DSPCA, as developed in [1], has a computational complexity of 0(n 4 -\/log n), with n the number of 
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features, which is too high for many large-scale data sets. At first glance, this complexity estimate 
indicates that solving sparse PCA is much more expensive than PCA, since we can compute one 
principal component with a complexity of 0(n 2 ). 

In this paper we show that solving DSPCA is in fact computationally easier than PCA, and hence can 
be applied to very large-scale data sets. To achieve that, we first view DSPCA as an approximation 
to a harder, cardinality-constrained optimization problem. Based on that formulation, we describe 
a safe feature elimination method for that problem, which leads to an often important reduction in 
problem size, prior to solving the problem. Then we develop a block coordinate ascent algorithm, 
with a computational complexity of 0(n 3 ) to solve DSPCA, which is much faster than the first- 
order algorithm proposed in [ 1 1. Finally, we observe that real data sets typically allow for a dramatic 
reduction in problem size as afforded by our safe feature elimination result. Now the comparison 
between sparse PCA and PCA becomes 0(n 3 ) v.s. 0(n 2 ) with n <C n, which can make sparse 
PCA surprisingly easier than PCA. 

In Section 2, we review the £i-norm based DSPCA formulation, and relate it to an approximation to 
the £o- norm based formulation and highlight the safe feature elimination mechanism as a powerful 
pre-processing technique. We use Section 3 to present our fast block coordinate ascent algorithm. 
Finally, in Section 4, we demonstrate the efficiency of our approach on two large data sets, each one 
containing more than 100,000 features. 

Notation. 71{Y) denotes the range of matrix Y, and its pseudo-inverse. The notation log refers 
to the extended-value function, with logs = — oo if x < 0. 

2 Safe Feature Elimination 

Primal problem. Given a n x n positive-semidefinite matrix E, the "sparse PCA" problem intro- 
duced in |[T| is : 

= max TrEZ- A||Z||i : Z>Q, TrZ=l (1) 
where A > is a parameter encouraging sparsity. Without loss of generality we may assume that 

Problem ([T]i is in fact a relaxation to a PCA problem with a penalty on the cardinality of the variable: 

ifj = max x T T,x — A||x||o : ||a;||2 = 1 (2) 

X 

Where ||x||o denotes the cardinality (number of non-zero elemements) in x. This can be seen by first 
writing problem Q as: 

max Tr EZ - \\J\\Z\\ Q : Z h 0, Tr Z = 1, Rank(Z) = 1 

where \\Z\\q is the cardinality (number of non-zero elements) of Z. Since \\Z\\i < y/\\Z\\o \\Z\\p = 
y/\\Z\\o, we obtain the relaxation 

max TrEZ - A||Z||i : Z y 0, Tr Z = 1, Rank(Z) = 1 
Further drop the rank constraint, leading to problem ([TJ. 

By viewing problem ([T]i as a convex approximation to the non-convex problem we can leverage 
the safe feature elimination theorem first presented in [6, 12 1 for problem (|2]): 

Theorem 2.1 Let E = A T A, where A = (a u . . . , o„) € R mxn . We have 

n 

V^= max V((af0 2 -^)+- 

2 — 1 

An optimal non-zero pattern corresponds to indices i with A < (aft;) 2 at optimum. 

We observe that the i-th feature is absent at optimum if (af £) 2 < A for every £, ||£||2 = 1. Hence, 
we can safely remove feature i € {1, . . . , n} if 

Eii = ajdi < A (3) 
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A few remarks are in order. First, if we are interested in solving problem (Tfjl as a relaxation to prob- 
lem pi, we first calculate and rank all the feature variances, which takes O(nm) and 0(nlog(n)) 
respectively. Then we can safely eliminate any feature with variance less than A. Second, the 
elimination criterion above is conservative. However, when looking for extremely sparse solutions, 
applying this safe feature elimination test with a large A can dramatically reduce problem size and 
lead to huge computational savings, as will be demonstrated empirically in Section 4. Third, in 
practice, when PCA is performed on large data sets, some similar variance-based criteria is rou- 
tinely employed to bring problem sizes down to a manageable level. This purely heuristic practice 
has a rigorous interpretation in the context of sparse PCA, as the above theorem states explicitly the 
features that can be safely discarded. 



3 Block Coordinate Ascent Algorithm 

The first-order algorithm developed in IT) to solve problem (fill has a computational complexity 
of 0{n i ^\ogn). With a theoretical convergence rate of 0(j), the DSPCA algorithm does not 
converge fast in practice. In this section, we develop a block coordinate ascent algorithm with better 
dependence on problem size (0(n 3 )), that in practice converges much faster. 



Failure of a direct method. We seek to apply a "row -by-row" algorithm by which we update each 
row/column pair, one at a time. This algorithm appeared in the specific context of sparse covariance 
estimation in iTPJl . and extended to a large class of SDPs in [14|. Precisely, it applies to problems of 
the form 

min f(X) ~ /31ogdetX : L < X < U, X y 0, (4) 

X 

where X — X T is a n x n matrix variable, L, U impose component-wise bounds on X, f is convex, 
and P > 0. 

However, if we try to update the row/columns of Z in problem (|TJ, the trace constraint will imply that 
we never modify the diagonal elements of Z. Indeed at each step, we update only one diagonal ele- 
ment, and it is entirely fixed given all the other diagonal elements. The row-by-row algorithm does 
not directly work in that case, nor in general for SDPs with equality constraints. The authors in lfT4l 
propose an augmented Lagrangian method to deal with such constraints, with a complication due to 
the choice of appropriate penalty parameters. In our case, we can apply a technique resembling the 
augmented Lagrangian technique, without this added complication. This is due to the homogeneous 
nature of t he objective function and of the conic constraint. Thanks to the feature elimination result 



(Thm. 2.1 1, we can always assume without loss of generality that A < cr^ n := rnirii<j<„ E 



Direct augmented Lagrangian technique. We can express problem (|TJ as 

\cj) 2 = max TrEX - X\\X\\t - hlrX) 2 : X h 0. (5) 

This expression results from the change of variable X = jZ, with Tr Z = 1, and 7 > 0. Optimizing 
over 7 > 0, and exploiting <fi > (which comes from our assumption that A < <r? nin ), leads to the 
result, with the optimal scaling factor 7 equal to cf>. An optimal solution Z* to ([TJ can be obtained 
from an optimal solution X* to the above, via Z* = X*/<j>. (In fact, we have Z* = X* / Tr(X*).) 

To apply the row-by-row method to the above problem, we need to consider a variant of it, with a 
strictly convex objective. That is, we address the problem 

max TrEX- A||X||i - i(TrX) 2 + /SlogdetX, : X >- 0, (6) 
x z 

where (3 > is a penalty parameter. SDP theory ensures that if j3 = e/n, then a solution to the 
above problem is e-suboptimal for the original problem lfl5ll . 



Optimizing over one row/column. Without loss of generality, we consider the problem of updat- 
ing the last row/column of the matrix variable X. Partition the latter and the covariance matrix S 
as 

X 



( y T 


:) 
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where Y,S<E rC"" 1 )*^- 1 ), y> s e R"" 1 , and x, cr e R. We are considering the problem above, 
where Y is fixed, and (j/, a;) e R n is the variable. We use the notation < := Tr Y. 

The conic constraint X >- translates as y T Y'y < x, y E 1Z{Y), where 1Z(Y) is the range of the 
matrix Y, We obtain the sub-problem 

tf-max f 2( 2/ T S -A|| 2 /|| 1 ) + ( ( T-A) a ;-i(i + a; ) 2 \ m 
' • :na " 1, +/31og( a; - 2 ; T yt y ) ).yZK(Y). (7) 

Simplifying the sub-problem. We can simplify the above problem, in particular, avoid the step of 
forming the pseudo-inverse of Y, by taking the dual of problem 0. 

Using the conjugate relation, valid for every r\ > 0: 

log rj + 1 = min z?7 — log z, 

2>0 

and with f(x) := (cr — X)x — i(t + x) 2 , we obtain 

ib + /3 — max 2(y T s — A||y||i) + f(x) + /3min (z(x — y T Y^y) — logz) 

= min max 2(y T s — A||y||i — f3zy T Y ' y) + max (/(a;) + (3zx) — (3 log 2 
= min h(z) + 2g(z) 

where, for z > 0, we define 

h(z) := — (3 log z + max (f(x) + f3zx) 

X 

= -Plogz + max ((cr - A + /3z)x - \{t + x) 2 ) 

= —\t 2 — /31ogz + max ((cr — A — t + f3z)x — \x 2 ) 
2 x 2 

= -^ 2 -/31ogz+i(cr-A-< + /3z) 2 

with the following relationship at optimum: 

x = cr - A-i + /3z. (8) 

In addition, 

/3z 

5 (z) := max y T s - \\\y\\i - — {y T Y^y) 
yen(Y) 2 

= max y s + mm y v (y Y 1 y) 

v en(Y) y v ■. |MU<A 2 vy J; 

/3z 

= min max (y T (s + v) (y T Y^y)) 

^ max (y T "- ^r(y Tyt y)) 

it : ll'U-sHocXA 5£R(y) 2 
1 T 

= min ~7}~ u Yu. 

u : I u— s || M < A 2pZ 

with the following relationship at optimum: 

y = ^Yu. (9) 

pz 

Putting all this together, we obtain the dual of problem (|7j): with?// := ijj+fi+^t 2 , and c := cr— A— i, 
we have 

■(/>' = min —u T Yu — f3 log z + -(c + /3z) 2 : z > 0, lln — slloo < A. 

u,z fjz 2 

Since /3 is small, we can avoid large numbers in the above, with the change of variable r = /3z: 

ib' - ft log @ = min -u T Yu - /31ogr+ \{c + t) 2 : r > 0, \\u - alU < A. (10) 

U.T T 2 
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Solving the sub-problem. Problem ( 10 1 can be further decomposed into two stages. 
First, we solve the box-constrained QP 

R? := min u T Yu : ||it - s|U < A, (11) 

u 

using a simple coordinate descent algorithm to exploit sparsity of Y, Without loss of generality, we 
consider the problem of updating the first coordinate of u. Partition u, Y and s as 

J l \ v _ ( 2/1 f ' \ „ _ ( s i 



Y 

U J ' \ y Y ) ' " V* 

Where, Y G r(«- 2 ) x («- 2 ), u,y 7 s £ R"~ 2 , y u s\ G R are all fixed, while rj G R is the variable. 
We obtain the subproblem 

min yi?7 2 + (2y T u)ri : \\r) - si\\ < \ (12) 

for which we can solve for -q analytically using the formula given below. 

^ if||si + ^||<A, yi >0, 
il= { Sl -X if -^»< Sl -A, yi >0orify T u>0, tfl = 0, (13) 
Sl + A if - > Sl + A, yi > 0orify T w<=0, tfl = 0. 



Next, we set r by solving the one-dimensional problem: 

R 2 1 

min ,SlogTH — (c + t) 2 . 

r>0 T 2 

The above can be reduced to a bisection problem over r, or by solving a polynomial equation of 
degree 3. 



Obtaining the primal variables. Once the above problem is solved, we can obtain the primal 
variables y, x, as follows. Using formula (|9jl, with j3z — r, we set y — ^Yu. For the diagonal 
element x, we use formula ([8]): x = c + t = <t — A — t + t. 

Algorithm summary. We summarize the above derivations in Algorithm [T] Notation: for any 
symmetric matrix A E R" xn ^ i e t A\i\ j denote the matrix produced by removing row i and column 
j. Let Aj denote column j (or row j) with the diagonal element Ajj removed. 

Convergence and complexity. Our algorithm solves DSPCA by first casting it to problem |6|, 
which is in the general form Q. Therefore, the convergence result from lfl4]l readily applies and 
hence every limit point that our block coordinate ascent algorithm converges to is the global opti- 
mizer. The simple coordinate descent algorithm solving problem (TT\ only involves a vector product 
and can take sparsity in Y easily. To update each column/row takes 0(n 2 ) and there are n such 
columns/rows in total. Therefore, our algorithm has a computational complexity of 0(Kn 3 ), where 
K is the number of sweeps through columns. In practice, K is fixed at a number independent of 
problem size (typically K = 5). Hence our algorithm has better dependence on the problem size 
compared to 0(n 4 y/\og n) required of the first order algorithm developed in JT]. 

Fig[T]shows that our algorithm converges much faster than the first order algorithm. On the left, both 
algorithms are run on a covariance matrix E = F T F with F Gaussian. On the right, the covariance 
matrix comes from a "spiked model" similar to that in [2 1, with £ = uu T + VV T /m, where u G R" 
is the true sparse leading eigenvector, with Card(it) = O.ln, V G R' iX ™ 1 is a noise matrix with 
Vij ~ Af(0, 1) and m is the number of observations. 



4 Numerical Examples 



In this section, we analyze two publicly available large data sets, the NYTimes news articles data 
and the PubMed abstracts data, available from the UCI Machine Learning Repository |16|. Both 
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Algorithm 1 Block Coordinate Ascent Algorithm 
Input: The covariance matrix E, and a parameter p > 0. 



SetX<°> = J 
repeat 

for j = 1 to n do 

Let X" -1 ' denote the current iterate. Solve the box-constrained quadratic program 

R 2 := min u T X { ^ ] u : \\u ~ EjH^ < A 

using the coordinate descent algorithm 
5: Solve the one-dimensional problem 

min — - p log r + \ (E ri - A - Tr X^ + rf 

T>0 T Z v7V7 

using a bisection method, or by solving a polynomial equation of degree 3. 
6: First set X^h = XyTy , and then set both X^ 's column j and row j using 



X u) = E,-,--A-TrX (i r 1) +T 



end for 

SetX (0) = X (n) 
until convergence 




Figure 1: Speed comparisons between Block Coordinate Ascent and First-Order 



text collections record word occurrences in the form of bag-of-words. The NYTtimes text collection 
contains 300, 000 articles and has a dictionary of 102, 660 unique words, resulting in a file of size 1 
GB. The even larger PubMed data set has 8, 200, 000 abstracts with 141, 043 unique words in them, 
giving a file of size 7.8 GB. These data matrices are so large that we cannot even load them into 
memory all at once, which makes even the use of classical PCA difficult. However with the pre- 
processing technique presented in Section 2 and the block coordinate ascent algorithm developed 
in Section 3, we are able to perform sparse PCA analysis of these data, also thanks to the fact that 
variances of words decrease drastically when we rank them as shown in Fig|2] Note that the feature 
elimination result only requires the computation of each feature's variance, and that this task is easy 
to parallelize. 

By doing sparse PCA analysis of these text data, we hope to find interpretable principal components 
that can be used to summarize and explore the large corpora. Therefore, we set the target cardinality 
for each principal component to be 5. As we run our algorithm with a coarse range of A to search for 
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Word Index x1Q 4 Word Index x1n * 

Figure 2: Sorted variances of 102,660 words in NYTimes (left) and 141,043 words in PubMed (right) 



a solution with the given cardinality, we might end up accepting a solution with cardinality close, 
but not necessarily equal to, 5, and stop there to save computational time. 

The top 5 sparse principal components are shown in Table[T]for NYTimes and in Table|2]for PubMed. 
Clearly the first principal component for NYTimes is about business, the second one about sports, 
the third about U.S., the fourth about politics and the fifth about education. Bear in mind that the 
NYTimes data from UCI Machine Learning Repository "have no class labels, and for copyright 
reasons no filenames or other document-level metadata" [ 16 1. The sparse principal components still 
unambiguously identify and perfectly correspond to the topics used by The New York Times itself to 
classify articles on its own website. 



Table 1: Words associated with the top 5 sparse principal components in NYTimes 
1st PC (6 words) 2nd PC (5 words) 3rd PC (5 words) 4th PC (4 words) 5th PC (4 words) 



million 


point 


official 


president 


school 


percent 


play 


government 


campaign 


program 


business 


team 


unitecLstates 


bush 


children 


company 


season 


u_s 


administration 


student 


market 


game 


attack 






companies 











After the pre-processing steps, it takes our algorithm around 20 seconds to search for a range of A 
and find one sparse principal component with the target cardinality (for the NYTimes data in our 
current implementation on a MacBook laptop with 2.4 GHz Intel Core 2 Duo processor and 2 GB 
memory). 



Table 2: Words associated with the top 5 sparse principal components in PubMed 
1st PC (5 words) 2nd PC (5 words) 3rd PC (5 words) 4th PC (4 words) 5th PC (4 words) 



patient 


effect 


human 


tumor 


year 


cell 


level 


expression 


mice 


infection 


treatment 


activity 


receptor 


cancer 


age 


protein 


concentration 


binding 


maligant 


children 


disease 


rat 




carcinoma 


child 



A surprising finding is that the safe feature elimination test, combined with the fact that word vari- 
ances decrease rapidly, enables our block coordinate ascent algorithm to work on covariance matri- 
ces of order at most n = 500, instead of the full order (n = 102660) covariance matrix for NYTimes, 
so as to find a solution with cardinality of around 5. In the case of PubMed, our algorithm only needs 
to work on covariance matrices of order at most n = 1000, instead of the full order (n = 141, 043) 
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covariance matrix. Thus, at values of the penalty parameter A that target cardinality of 5 commands, 
we observe a dramatic reduction in problem sizes, about 150 ~ 200 times smaller than the original 
sizes respectively. This motivates our conclusion that sparse PCA is in a sense, easier than PCA 
itself. 

5 Conclusion 

The safe feature elimination result, coupled with a fast block coordinate ascent algorithm, allows 
to solve sparse PCA problems for very large scale, real-life data sets. The overall method works 
especially well when the target cardinality of the result is small, which is often the case in applica- 
tions where interpretability by a human is key. The algorithm we proposed has better computational 
complexity, and in practice converges much faster than, the first-order algorithm developed in JT]. 
Our experiments on text data also show that the sparse PCA can be a promising approach towards 
summarizing and organizing a large text corpus. 
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